Skip to main content

Failed autonomous investigations by customer - High volume

Monitor - https://app.axiom.co/legion-security-ry4e/monitors/view/EL5lIqRzOcqEWUYlmK

Overview​

This monitor is built to catch issues for customers who run a lot of autonomous investigations. It looks at how many of those investigations failed within a short time window. If the number of failures goes over the threshold, it triggers an alert.

Because the time window is pretty small, an alert from this monitor usually means the problem started recently.

Finding the problem​

  1. Run the monitor's query and find the list of customers with failure rate at or higher from the defined threshold

    traces 
    | where name == "POST /investigations/{investigation_id}/terminate"
    | where ['resource.deployment.environment'] == "production"
    | where ['attributes.new_status'] == "failed"
    | where ['status.code'] != "ERROR"
    | summarize FailedRuns=dcount(['attributes.investigation_id']) by ['attributes.org_id'], org_name
    | where FailedRuns >= 3
  2. See next steps same as for Failed autonomous investigations by customer - Low volume